Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: do not include uncommitted changes in the sdist #587

Merged
merged 4 commits into from
May 2, 2024

Conversation

dnicolodi
Copy link
Member

Including uncommitted changes in the sdist was implemented in #58 after the discussion in #53. However, the current behavior for which uncommitted changes to files under version control are included but not other files, is an hard to justify surprising half measure.

After careful analysis, it has been determined that none of the use cases for this feature is still valid. Removing it makes the implementation easier, and the behavior less surprising and easier to document an explain.

@rgommers
Copy link
Contributor

Thanks. Quick comment: could you update the sdist docs in docs/explanations/design-old.rst for this change, and mention add_dist_script there?

@dnicolodi
Copy link
Member Author

design-old.rst is not linked anywhere, thus it is not published. I thought we keep it only as base for some future documentation work. This is why I haven't updated it. I'm also working on some documentation updates, but the job got bigger than anticipated and it is not finished yet.

@rgommers
Copy link
Contributor

rgommers commented Mar 1, 2024

design-old.rst is not linked anywhere, thus it is not published. I thought we keep it only as base for some future documentation work

You're right. My bad, that's what I get for trying to squeeze in a quick comment before signing off. I saw the removal of the add_dist_script comment, then looked for a mention in the docs, didn't find it, then looked for .gitattributes. Our only mention of that is in design-old.rst. Looks like we should revive some of that content and link to https://mesonbuild.com/Creating-releases.html

@dnicolodi
Copy link
Member Author

Should we merge this and leave the documentation updates to a separate PR?

@rgommers
Copy link
Contributor

rgommers commented Mar 4, 2024

I'm okay with doing the docs in a follow-up. This is ready to go, right? If so, I need to do some testing with numpy, scipy, & co.

@dnicolodi
Copy link
Member Author

Yes, it is ready to go. The documentation updates are coming soon.

@dnicolodi dnicolodi added the enhancement New feature or request label Mar 5, 2024
@QuLogic
Copy link
Member

QuLogic commented Mar 23, 2024

Some typos: s/changer/changes/ in the PR title and commit summary, and s/an hard/a hard/ and s/document an explain/document and explain/ in the description.

@rgommers rgommers added this to the v0.17.0 milestone Apr 17, 2024
@rgommers rgommers changed the title ENH: do not include uncommitted changer in the sdist ENH: do not include uncommitted changes in the sdist Apr 18, 2024
@rgommers
Copy link
Contributor

@dnicolodi I'd like to merge this for the next release after doing some final testing (likely tomorrow). Would you mind resolving the conflict and fixing the commit message typos pointed out by @QuLogic?

@dnicolodi
Copy link
Member Author

@rgommers Sure, done.

Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the "uncommitted changes" feature looks fine to me. It is indeed no longer needed in numpy/scipy, and the planned pip change to install via sdist (which was a worry 3 years ago) is permanently shelved/rejected now AFAIK.

The change to the test looks good. There are a few other changes in behavior (file permissions and ownership) in the current version of this PR that you didn't touch on in the PR description and may be unintentional.

docs/explanations/design-old.rst Outdated Show resolved Hide resolved
docs/explanations/design-old.rst Outdated Show resolved Hide resolved
docs/explanations/design-old.rst Outdated Show resolved Hide resolved
file_stat = os.stat(path)
info.mtime = member.mtime
info.size = file_stat.st_size
info.mode = int(oct(file_stat.st_mode)[-3:], 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are meaningful and not strictly related to uncommitted changes to files.

Example with SciPy:

$ ls -l tools/lint.*  # local git repo
-rwxr-xr-x 1 rgommers rgommers 4318 19 apr 12:16 ../tools/lint.py
-rw-r--r-- 1 rgommers rgommers 1180 19 apr 12:16 ../tools/lint.toml

Compared to what I see in the sdist (screenshots because extracting files changes ownership):

with 0.16.0:
image

with this PR:
image

Note that the file permissions changed compared to what's in the git repo, and ownership now includes my own username. The former I'm not sure about either way, that is behavior inherited from meson dist it looks like. The latter seems undesirable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is unintended, in the sense that I trustedmeson dist (and thus in turn git archive) to be doing the right thing. I'm also not sure whether the previous behavior was deliberate: it seems that files generated via dist scripts do not go through the same metadata mangling, and the metadata mangling seems more an accidental omission than a normalization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For file ownership, I checked git archive --format=tar.gz HEAD > tmp.tar.gz and it sets the ownership to root. So I guess the repacking changes it to local-username. Either way, the current behavior seems preferred, since it gives the same result when run on multiple machines.

Re file permissions: there are config settings for it, e.g. tar.umask in https://git-scm.com/docs/git-archive#_configuration. What git-archive does by default is probably good for reproducibility, and seems to be by design.

Having a peek in meson/mdist.py, it doesn't pass any options like --owner=0/--group=0, --no-same-owner, --no-same-permissions to tar. Not sure if that just never came up, or was rejected before. I can't find anything related in the Meson issue tracker so quickly.

I'm also not sure whether the previous behavior was deliberate: it seems that files generated via dist scripts do not go through the same metadata mangling

I wouldn't read too much into that, since this code was written pretty early in the project's history, and I don't think I knew about add_dist_script at the time; use of add_dist_script probably hadn't come up at all yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only spent a few minutes investigating the thing, and it seems that the archive metadata changes happen in Meson. I tend to think that it is a Meson bug. Unless there are subtleties I am missing, meson-python just copies over the archive members as is. The Meson behavior needs to be fixed, but in the meantime we can fix the metadata in meson-python.

What git-archive does by default is probably good for reproducibility, and seems to be by design.

What does git archive do, other than setting the owner to root:root?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One funny thing, is that git archive, Meson, and meson-python seem to use three different tar formats.

It would be nice if there were a way to avoid packing the files up three times, but I'm not sure there is one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if there were a way to avoid packing the files up three times, but I'm not sure there is one.

git archive and meson dist are trying to do fundamentally different things as a result of dist scripts existing. It will never be possible to reuse git's output directly (at least, impossible except as a micro-optimization for a subset of cases that always excludes the creation of python dist tarballs with sdist metadata.

It would be possible to uplift the meson-python specific changes into meson dist, of course. At the most basic level, it's "just" a kind of domain specific dist script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that reproducibility is not necessarily a requirement of meson dist

It would be nice for meson dist to be reproducible if the dist scripts are reproducible, following a bit more closely what git archive does. Using the correct file metadata in the archive does not seem difficult to achieve, and it is at least a step in the right direction.

Meson uses shutil.make_archive, which isn't well known for it's high degree of configurability

shutil.make_archive() has owner and group argument, that seem to fit at least part of the requirement.

For avoiding packing the files three times, I was more thinking about not having the lower layer pack them at all and have the upper layer do the packing in the way it likes, more than reusing the tarball generated by the lower layer. meson-python just needs to add a file to the archive, and this is possible without unpacking and repacking. The reason why meson-python nedds to unpack and repack the tarball is to fix the rood directory in the arvhive: the project name and version seen by Meson and meson-python are not necessarily the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seem to fit at least part of the requirement.

"Part", eh. :D

Granted this is something that could be properly changed by giving up shutil.make_archive as a bad idea and rolling tarfile/zipfile manually.

The reason why meson-python nedds to unpack and repack the tarball is to fix the rood directory in the arvhive: the project name and version seen by Meson and meson-python are not necessarily the same.

... although they probably should be, and for dynamic version it will be required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although they probably should be, and for dynamic version it will be required.

I'm not sure they should. For project that only want to produce a Python packages, specifying the version only in pyproject.toml seems a perfectly valid thing to do, and for project that are more than a Python package, it seems reasonable to have the Python part have a different name than the whole. Introducing a restriction for them to be the same does not seem worth just to make it a bit easier to generate the sdist and it would be easier only if we fix meson dist to emit the right archive member metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying it should be restricted, just that in a very large number of common cases they will be the same and won't need tweaking.

It is, at least, a useful micro optimization to check for (assuming no other issues).

As for versions, you can provide it in either or. But if you provide it in meson.build you can handle git versioning as well as using the version as a replacement string to insert into e.g. _version.py, so it is my (entirely personal) opinion that people should want to declare it as dynamic, because it is simply superior.

@dnicolodi dnicolodi force-pushed the sdist-strict branch 3 times, most recently from 6f48e18 to 638caa3 Compare April 23, 2024 10:56
@dnicolodi
Copy link
Member Author

@rgommers I fixed the ownership metadata in the sdist archive to be root:root (not only to use the 0 uid and gid as it was implicitly done before). This follows what git archive does. I left the permissions unchanged: we use the same permissions exported by git archive. It seems a valid choice to me. I added another commit that makes the sdist byte-for-byte reproducible in most cases, see the longish comments in the code. I think it is a worthwhile goal and it relatively easy to achieve. Finally, I promoted and expanded the little documentation about sdist we had in the unpublished document to its own section in the manual. Please have a look.

@dnicolodi dnicolodi added the documentation Improvements or additions to documentation label Apr 23, 2024
@dnicolodi dnicolodi force-pushed the sdist-strict branch 2 times, most recently from f0cdb9b to 4503aec Compare April 23, 2024 17:27
@rgommers
Copy link
Contributor

That sounds great! Thanks for improving the docs too. I'll try and revisit this in the morning.

Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a new issue with this latest change. Trying this on NumPy, I get a split tree due to different package names and versions:

$ tree -L 2 .
.
├── numpy-2.1.0.dev0
│   ├── azure-pipelines.yml
...
│   ├── tools
│   └── vendored-meson
└── NumPy-2.1.0.dev0+git20240424.f8392ce
    ├── doc
    ├── numpy
    └── vendored-meson

The project name is a difference in capitalization (easy to change, but still this shouldn't happen). The git version being appended in the project() call in meson.build is something I'm not a big fan of, but it's what people do. It should not affect the sdist contents, only the version in pyproject.toml should be used.

@dnicolodi
Copy link
Member Author

I'm very surprised that this happens, especially by the path that two different paths are used. I don't see how that can happen. Let me see if I can reproduce it.

@dnicolodi
Copy link
Member Author

dnicolodi commented Apr 24, 2024

Actually, we have tests where the project name passed to Meson's project() and the Python distribution name differ, and these work just fine. I really don't see how what you see could happen. Also, the code that adjusts the paths didn't change compared to what you tested previously. Are you sure that the test is correct?

@dnicolodi
Copy link
Member Author

I can reproduce it building an sdist for numpy. But I really cannot get how it can happen.

@dnicolodi
Copy link
Member Author

All the paths that are not correctly rewritten are longer than 100 characters. It looks like a Python bug to me. Investigating further.

@dnicolodi
Copy link
Member Author

Yup. I'm pretty sure it is a Python tarfile module bug: when the name filed in the TarInfo object is set, and the object has an associated extended pax header, the path property in the pax extended header is not reset. If the new name is shorter than what requires the path to be stored in the extended pax header, the path in the pax header is not updated when the archive member is serialized. When the header is read back, the path in the extended path header is read, instead of the shorter one in the regular header. The work around is simple. I'll implement it with a test tomorrow.

@dnicolodi
Copy link
Member Author

Issue fixed and test added. This was a funny one to track down.

@dnicolodi
Copy link
Member Author

The codecov CI job is stuck and I didn't find a way to kill it.

@rgommers
Copy link
Contributor

rgommers commented May 1, 2024

@dnicolodi it looks like you didn't actually push your fix?

@dnicolodi
Copy link
Member Author

Indeed. Pushed now. I think I was waiting to file a bug for CPython to be able to reference it in the code comment and commit message, but I haven't submitted it yet. I don't think it is important to reference it here, and I can add a reference later. Pushed now.

Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retested, everything looks good to go now, modulo a few typos. Thanks @dnicolodi!

mesonpy/__init__.py Outdated Show resolved Hide resolved
mesonpy/__init__.py Outdated Show resolved Hide resolved
mesonpy/__init__.py Outdated Show resolved Hide resolved
Including uncommitted changes in the sdist was implemented in mesonbuild#58
after the discussion in mesonbuild#53. However, the current behavior for which
uncommitted changes to files under version control are included but
not other files, is a hard to justify surprising half measure.

After careful analysis, it has been determined that none of the use
cases for this feature is still valid. Removing it makes the
implementation easier, and the behavior less surprising and easier to
document and explain.

While at it, modernize the associated test.
It is not relevant for the users and takes precious space in the
documentation index, which is getting quite long.
@dnicolodi dnicolodi merged commit 09cfd8e into mesonbuild:main May 2, 2024
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants